Project 3 -- PCAΒΆ

Part 1: Getting startedΒΆ

InΒ [Β ]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

data_raw = pd.read_csv(
    filepath_or_buffer='https://raw.githubusercontent.com/Sabaae/Dataset/main/domestic_product.csv',
    index_col=0
)

def plot_time_series(df, countries):
    fig, axes = plt.subplots(len(countries), 1, figsize=(10, len(countries) * 5))
    for i, country in enumerate(countries):
        country_data = df.loc[country]
        axes[i].plot(country_data.index, country_data.values)
        axes[i].set_title(f"Time-series for{country}")
        axes[i].set_xlabel("Year")
        axes[i].set_ylabel("Gross Domestic Product")
        years = country_data.index
        axes[i].set_xticks(years[::5])
    plt.show()

countries_list = [" Albania ", " Greece ", " Somalia ", " Sweden ", " Oman ", " Italy "]
print("Original Plots: ")
plot_time_series(data_raw, countries_list)

scaler = StandardScaler()
data_standardized = data_raw.copy()
data_standardized[:] = scaler.fit_transform(data_standardized)

print("Standardized Plots: ")
plot_time_series(data_standardized, countries_list)
Original Plots: 
No description has been provided for this image
Standardized Plots: 
No description has been provided for this image

The standardized time-series for these countries illustrates each country's GDP relative to others, as the values are scaled to have a mean of 0 and a standard deviation of 1. Unlike the original plots, the trends in the standardized plots reflect relative changes, with the curves going up and down frequently rather than showing the absolute GDP growth from the original dataset. Most countries show a slightly decreasing trend, with lower values in 2020 than in 1970. However, Oman is an exception, with a value of -0.191 in 1970 and -0.189 in 2020. This slight increase over time indicates that Oman's GDP growth relative to other countries was faster compared to earlier years. Similarly, when the curve goes down, it means the country's GDP growth is slower relative to other countries. A positive value indicates that the country's GDP for a given year is above the mean GDP of all countries for that year, while a negative value indicates it is below the mean GDP.

Part 2: Applying PCAΒΆ

  1. Compute the covariance matrix of the dataframe. *Hint: The dimensions of your covariance matrix should be (52, 52).
  2. Write a function get_sorted_eigen(df_cov) that gets the covariance matrix of dataframe df (from step 1), and returns sorted eigenvalues and eigenvectors using np.linalg.eigh.
  3. Show the effectiveness of your principal components in covering the variance of the dataset with a scree plot.
  4. How many PCs do you need to cover 99.9% of the dataset's variance? The cumulative variance reaches 99.9% at the fifth component of the printed cumExpVar, showing that 99.91% exceeds 99.9%. Therefore, 5 principal components are required to cover 99.9% of dataset's variance.
  5. Plot the first 16 principal components (Eigenvectors) as a time series (16 subplots, on the x-axis you have dates and on the y-axis you have the value of the PC element).
  6. Compare the first few PCs with the rest of them. Do you see any difference in their trend? The first few PCs show clear trends that reveal major patterns in the data. PC 1 decreases steadily from 0.139 in 1970 to 0.064 in 2020. The value of PC 2 shows an increase from 0.003 to 0.032. PC 3 has a slight overall increase but with small fluctuations. In contrast, the rest PCs from the fourth one do not show any clear trend, instead moving up and down randomly during the time period. This difference means that the first few PCs capture key trends or changes over time, while the later PCs reflect smaller, less consistent variations.
InΒ [Β ]:
import numpy as np

# 1. The covariance matrix of the dataframe
df_cov = np.cov(data_standardized.T)
print(df_cov.shape)

# 2. Function that gets the covariance matrix of df, returns sorted eigenvalues and eigenvectors
def get_sorted_eigen(df_cov):
    eigenValues, eigenVectors = np.linalg.eigh(df_cov)

    # sorting in decreasing order
    args = (-eigenValues).argsort()
    eigenValues = eigenValues[args]
    eigenVectors = eigenVectors[:, args]

    return eigenValues, eigenVectors

# 3. Show effectiveness of principle components in covering the variance
eigenValues, eigenVectors = get_sorted_eigen(df_cov)
eigValSum = sum(eigenValues)
expVar = [eigV/eigValSum*100 for eigV in eigenValues]
cumExpVar = np.cumsum(expVar)
print(cumExpVar)

plt.bar(range(10), expVar[:10], label='Explained Variance')
plt.plot(cumExpVar[:10], 'r-o', label='Cumulative Explained Variance')
plt.legend()
plt.show()

# 5. Plot the first 16 PCs as a time series
years = data_standardized.columns
fig, axes = plt.subplots(16, 1, figsize=(10, 16 * 3))
for i in range(16):
    pc_data = eigenVectors[:, i]
    axes[i].plot(years, pc_data)
    axes[i].set_title(f"Time-series for PC {i+1}")
    axes[i].set_xlabel("Year")
    axes[i].set_ylabel(f"PC Value")
    axes[i].set_xticks(years[::5])
plt.tight_layout()
plt.show()
(52, 52)
[ 95.31499014  98.95623653  99.67607837  99.83935344  99.90633114
  99.94418134  99.96358004  99.97863499  99.98407348  99.98819769
  99.99105253  99.99334173  99.99490079  99.99604911  99.99703884
  99.99751032  99.9979238   99.99829534  99.99860684  99.9988423
  99.99903082  99.99921154  99.99936265  99.99947642  99.99957731
  99.99965223  99.99971498  99.9997618   99.99980548  99.99984546
  99.99987235  99.99989208  99.9999105   99.99992477  99.99993781
  99.999947    99.99995571  99.9999625   99.99996881  99.99997425
  99.99997844  99.99998221  99.99998542  99.99998845  99.99999124
  99.99999335  99.999995    99.9999964   99.9999977   99.999999
  99.99999962 100.        ]
No description has been provided for this image
No description has been provided for this image

Part 3: Data reconstructionΒΆ

Create a function that:

  • Accepts a country and the original dataset as inputs.
  • Calls useful functions that you designed in previous parts to compute eigenvectors and eigenvalues.
  • Plots 4 figures:
    1. The original time-series for the specified country.
    2. The incremental reconstruction of the original (not standardized) time-series for the specified country in a single plot.
    • You should at least show 5 curves in a figure for incremental reconstruction. For example, you can pick the following (or any other combination that you think is reasonable):

      • Reconstruction with only PC1
      • Reconstruction with both PC1 and PC2
      • Reconstruction with PC1 to PC4 (First 4 PCs)
      • Reconstruction with PC1 to PC8 (First 8 PCs)
      • Reconstruction with PC1 to PC16 (First 16 PCs)
    • Hint: you need to compute the reconstruction for the standardized time-series first, and then scale it back to the original (non-standardized form) using the StandardScaler inverse_transform help...

    1. The residual error for your best reconstruction with respect to the original time-series.
    • Hint: You are plotting the error that we have for reconstructing each year (df - df_reconstructed). On the x-axis, you have dates, and on the y-axis, the residual error.
    1. The RMSE of the reconstruction as a function of the number of included components (x-axis is the number of components and y-axis is the RMSE). Sweep x-axis from 1 to 10 (this part is independent from part 3.2.)

Test your function using the three countries you selected in Part 1.3 plus Greece,Somalia, and Italy as inputs.

InΒ [Β ]:
from sklearn.metrics import mean_squared_error

def compute_recon(X_std, PC_count, eigenVectors):
    projX = np.dot(X_std, eigenVectors[:, 0:PC_count])
    reconX = np.dot(projX, eigenVectors[:, 0:PC_count].T)
    return reconX

def plot_country_figures(original_df, country_name):
    country_data = original_df.loc[country_name]
    country_data_std = scaler.transform(country_data.values.reshape(1, -1)).reshape(1, -1)
    years = country_data.index

    # Figure 1: original time-series for the specified country
    plt.figure(figsize=(10, 5))
    plt.plot(years, country_data.values)
    plt.title(f"Original Time-series for{country_name}")
    plt.xlabel("Year")
    plt.ylabel("Gross Domestic Product")
    plt.xticks(years[::5])
    plt.show()

    # Figure 2: incremental reconstruction of the original time-series
    PC_counts = [1, 2, 4, 8, 16]
    plt.figure(figsize=(10, 5))
    for PC_count in PC_counts:
        recon_std = compute_recon(country_data_std, PC_count, eigenVectors)
        recon_original = scaler.inverse_transform(recon_std.reshape(1, -1)).flatten()
        plt.plot(original_df.columns, recon_original, label=f"Reconstruction with {PC_count} PCs")
    #plt.plot(original_df.columns, country_data, label="Original Data")
    plt.title(f"Incremental Reconstruction of Original Time-series for{country_name}")
    plt.xlabel("Year")
    plt.ylabel("Gross Domestic Product")
    plt.legend()
    plt.xticks(years[::5])
    plt.show()

    # Figure 3: residual error for best reconstruction (PC=16)
    best_recon_std = compute_recon(country_data_std, 16, eigenVectors)
    best_recon_original = scaler.inverse_transform(best_recon_std.reshape(1, -1)).flatten()
    residual_err = country_data - best_recon_original
    plt.figure(figsize=(10, 5))
    plt.plot(original_df.columns, residual_err)
    plt.title(f"Residual Error for Best Reconstruction for{country_name}")
    plt.xlabel("Year")
    plt.ylabel("Residual Error")
    plt.xticks(years[::5])
    plt.show()

    # Figure 4: RMSE of the reconstruction as a function of the number of included components
    rmse_list = []
    for i in range(1, 11):
        recon_std = compute_recon(country_data_std, i, eigenVectors)
        recon_original = scaler.inverse_transform(recon_std.reshape(1, -1)).flatten()
        rmse = np.sqrt(mean_squared_error(country_data, recon_original))
        rmse_list.append(rmse)

    plt.figure(figsize=(10, 5))
    plt.plot(range(1, 11), rmse_list)
    plt.title(f"RMSE of Reconstruction for{country_name}")
    plt.xlabel("Number of Principal Components")
    plt.ylabel("RMSE")
    plt.xticks(range(1, 11))
    plt.show()

# Plot figures
new_countries = [" Greece ", " Somalia ", " Italy ", " Viet Nam ", " Zambia ", " Zimbabwe "]
for country in new_countries:
    plot_country_figures(data_raw, country)
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Part 4: SVDΒΆ

Modify your code in part 3 to use SVD instead of PCA.

Explain if standardization or covariance computation is required for this part. Repeat part 3 and compare your PCA and SVD results. Write a function to make this comparison, and comment on the results.

SVD does not depend on variance or the covariance matrix. Instead, it breaks the data matrix into singular values and vectors, working directly with raw data. This makes SVD suitable for dimensionality reduction without needing standardization.

The three plots provide a comparison of PCA and SVD applied to the standardized data. In the first plot, both PCA and SVD reconstructions closely match the actual GDP values over time, indicating that both methods effectively capture the main trends in the data. The two curves are overlapping, showing that the reconstructions are identical.

In the second plot that displays the residual errors, the curves for PCA and SVD are also similar. This suggests that both methods approximate the data equally well, with the same level of accuracy in capturing the fluctuations in GDP.

The third plot demonstrates RMSE of the reconstructions as the number of principal components increases. Both PCA and SVD illustrate a similar decline in RMSE with more components, indicating that both methods improve in accuracy as more components are added.

The same result is expected because the data are scaled so that each feature has a mean of 0 and a standard deviation of 1. In this case, applying SVD to standardized data essentially performs the same computations as PCA since they both decompose the covariance structure of the standardized data, leading to identical results.

InΒ [Β ]:
from numpy.linalg import svd

def plot_country_figures(original_df, country_name):
    country_data = original_df.loc[country_name]
    years = country_data.index

    # apply SVD
    U, S, Vt = svd(original_df, full_matrices=False)
    V = Vt.T

    # Figure 1: original time-series for the specified country
    plt.figure(figsize=(10, 5))
    plt.plot(years, country_data.values)
    plt.title(f"Original Time-series for{country_name}")
    plt.xlabel("Year")
    plt.ylabel("Gross Domestic Product")
    plt.xticks(years[::5])
    plt.show()

    # Figure 2: incremental reconstruction of the original time-series
    PC_counts = [1, 2, 4, 8, 16]
    plt.figure(figsize=(10, 5))
    for PC_count in PC_counts:
        recon = compute_recon(country_data.values.reshape(1, -1), PC_count, V)
        plt.plot(original_df.columns, recon.flatten(), label=f"Reconstruction with {PC_count} PCs")

    plt.title(f"Incremental Reconstruction of Original Time-series for{country_name}")
    plt.xlabel("Year")
    plt.ylabel("Gross Domestic Product")
    plt.legend()
    plt.xticks(years[::5])
    plt.show()

    # Figure 3: residual error for best reconstruction (PC=16)
    best_recon = compute_recon(country_data.values.reshape(1, -1), 16, V)
    residual_err = country_data.values - best_recon.flatten()
    plt.figure(figsize=(10, 5))
    plt.plot(original_df.columns, residual_err)
    plt.title(f"Residual Error for Best Reconstruction for{country_name}")
    plt.xlabel("Year")
    plt.ylabel("Residual Error")
    plt.xticks(years[::5])
    plt.show()

    # Figure 4: RMSE of the reconstruction as a function of the number of included components
    rmse_list = []
    for i in range(1, 11):
        recon = compute_recon(country_data.values.reshape(1, -1), i, V)
        rmse = np.sqrt(mean_squared_error(country_data.values, recon.flatten()))
        rmse_list.append(rmse)

    plt.figure(figsize=(10, 5))
    plt.plot(range(1, 11), rmse_list)
    plt.title(f"RMSE of Reconstruction for{country_name}")
    plt.xlabel("Number of Principal Components")
    plt.ylabel("RMSE")
    plt.xticks(range(1, 11))
    plt.show()

# Plot figures
for country in new_countries:
    plot_country_figures(data_raw, country)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InΒ [Β ]:
def plot_pca_svd_comparison(original_df, country_name):
    country_data = original_df.loc[country_name]
    country_data_std = scaler.fit_transform(country_data.values.reshape(-1, 1)).reshape(1, -1)
    years = country_data.index

    # apply standardization on svd for comparison
    U, S, Vt = svd(data_standardized, full_matrices=False)
    V = Vt.T

    # plot best reconstruction pca vs svd (PC=16)
    best_recon_pca_std = compute_recon(country_data_std, 16, eigenVectors)
    best_recon_pca = scaler.inverse_transform(best_recon_pca_std).flatten()
    best_recon_svd_std = compute_recon(country_data_std, 16, V)
    best_recon_svd = scaler.inverse_transform(best_recon_svd_std).flatten()

    plt.figure(figsize=(10, 5))
    plt.plot(years, best_recon_pca, label="PCA Reconstruction", linestyle="--")
    plt.plot(years, best_recon_svd, label="SVD Reconstruction", linestyle=":")
    #plt.plot(years, country_data.values, label="Original Data")
    plt.title(f"Best Reconstruction PCA vs SVD for{country_name}")
    plt.xlabel("Year")
    plt.ylabel("Gross Domestic Product")
    plt.legend()
    plt.xticks(years[::5])
    plt.show()

    # plot residual errors for best reconstruction for pca vs svd
    re_pca = country_data.values - best_recon_pca
    re_svd = country_data.values - best_recon_svd

    plt.figure(figsize=(10, 5))
    plt.plot(years, re_pca, label="PCA Residual", linestyle="--")
    plt.plot(years, re_svd, label="SVD Residual", linestyle=":")
    plt.title(f"Residual Errors for Best Reconstruction for PCA vs SVD for{country_name}")
    plt.xlabel("Year")
    plt.ylabel("Residual Error")
    plt.legend()
    plt.xticks(years[::5])
    plt.show()

    # plot RMSE of reconstruction for pca vs svd
    rmse_pca = []
    rmse_svd = []
    for i in range(1, 11):
        recon_pca_std = compute_recon(country_data_std, i, eigenVectors)
        recon_pca = scaler.inverse_transform(recon_pca_std).flatten()
        rmse_pca.append(np.sqrt(mean_squared_error(country_data.values, recon_pca)))

        recon_svd_std = compute_recon(country_data_std, i, V)
        recon_svd = scaler.inverse_transform(recon_svd_std).flatten()
        rmse_svd.append(np.sqrt(mean_squared_error(country_data.values, recon_svd)))

    plt.figure(figsize=(10, 5))
    plt.plot(range(1, 11), rmse_pca, label="PCA RMSE", linestyle="--")
    plt.plot(range(1, 11), rmse_svd, label="SVD RMSE", linestyle=":")
    plt.title(f"RMSE of Reconstruction for PCA vs SVD for{country_name}")
    plt.xlabel("Number of Principal Components")
    plt.ylabel("RMSE")
    plt.legend()
    plt.xticks(range(1, 11))
    plt.show()

# plot comparison figures
for country in new_countries:
    plot_pca_svd_comparison(data_raw, country)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Part 5: Let's collect another dataset!ΒΆ

Create another dataset similar to the dataset in part 1, this time using the raw information on average monthly electricity sales across various states of the United States from 2001 to 2024 here.

You need to manipulate the data to organize it in the desired format (i.e., the same format that was in previous parts). Missing values were removed such that if there was a missing value for the average electricity sales of a particular state at a given date, that date has been completely removed from the dataset, even if the data of that specific date existed for other states.

You are free to use any tools you like, from Excel to Python! In the end, you should have a new CSV file similar to the previous dataset. How many features does the final dataset have? How many cities are there?

There are 277 features and 62 unique cities in train.csv.

Upload your new dataset (in CSV format) to your colab notebook, repeat part 4 for this dataset, and comment on the results. When analyzing the states, use New York, Utah, and three other states with the closest alphabetical names to your first name.

The best reconstruction plot for PCA vs SVD illustrates the reconstructed average electricity sales over time. Both the PCA and SVD curves go up and down repeatedly and overlap, meaning that they effectively capture the cyclical patterns in the data, particularly the seasonal spikes on some specific months. These repeating peaks imply a seasonal trend in electricity usage, likely aligning with weather changes.

The residual error plot emphasizes the consistency between PCA and SVD as they are identical. Although, there are periodic errors, possibly due to the seasons, both methods struggle equally, confirming their similar performance in capturing the underlying sales patterns. Overall, the plots highlight the regular seasonal fluctuations in the specific state's electricity data.

The RMSE decreases for original data when applying SVD as more components capture meaningful variance, improving accuracy. For standardized data, RMSE increases with more components. This may because of the overfitting noise from equal feature weighting. The trend on original data is typically more reliable, reflecting true variance.

The code below helps you to upload your new CSV file to your colab session.

InΒ [Β ]:
# load train.csv to Google Colab
from google.colab import files
uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving train.csv to train.csv
InΒ [Β ]:
city_data = pd.read_csv("train.csv", index_col=0)

def plot_city_figures(original_df, city_name):
    city_data = original_df.loc[city_name]
    years = city_data.index

    # apply SVD
    U, S, Vt = svd(original_df, full_matrices=False)
    V = Vt.T

    # Figure 1: original time-series for the specified city
    plt.figure(figsize=(20, 5))
    plt.plot(years, city_data.values)
    plt.title(f"Original Time-series for {city_name}")
    plt.xlabel("Year Month")
    plt.ylabel("Average Electricity Sales")
    plt.xticks(years[::12], rotation=45)
    plt.show()

    # Figure 2: incremental reconstruction of the original time-series
    PC_counts = [1, 2, 4, 8, 16]
    plt.figure(figsize=(20, 5))
    for PC_count in PC_counts:
        recon = compute_recon(city_data.values.reshape(1, -1), PC_count, V)
        plt.plot(original_df.columns, recon.flatten(), label=f"Reconstruction with {PC_count} PCs")

    plt.title(f"Incremental Reconstruction of Original Time-series for {city_name}")
    plt.xlabel("Year Month")
    plt.ylabel("Average Electricity Sales")
    plt.legend()
    plt.xticks(years[::12], rotation=45)
    plt.show()

    # Figure 3: residual error for best reconstruction (PC=16)
    best_recon = compute_recon(city_data.values.reshape(1, -1), 16, V)
    residual_err = city_data.values - best_recon.flatten()
    plt.figure(figsize=(20, 5))
    plt.plot(original_df.columns, residual_err)
    plt.title(f"Residual Error for Best Reconstruction for {city_name}")
    plt.xlabel("Year Month")
    plt.ylabel("Residual Error")
    plt.xticks(years[::12], rotation=45)
    plt.show()

    # Figure 4: RMSE of the reconstruction as a function of the number of included components
    rmse_list = []
    for i in range(1, 11):
        recon = compute_recon(city_data.values.reshape(1, -1), i, V)
        rmse = np.sqrt(mean_squared_error(city_data.values, recon.flatten()))
        rmse_list.append(rmse)

    plt.figure(figsize=(10, 5))
    plt.plot(range(1, 11), rmse_list)
    plt.title(f"RMSE of Reconstruction for {city_name}")
    plt.xlabel("Number of Principal Components")
    plt.ylabel("RMSE")
    plt.xticks(range(1, 11))
    plt.show()

# Plot figures
cities = ["New York", "Utah", "West Virginia", "Wisconsin", "Wyoming"]
for city in cities:
    plot_city_figures(city_data, city)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InΒ [Β ]:
def plot_pca_svd_comparison(original_df, city_name):
    city_data = original_df.loc[city_name]
    city_data_std = scaler.fit_transform(city_data.values.reshape(-1, 1)).reshape(1, -1)
    years = city_data.index

    # apply standardization on svd for comparison
    data_standardized = original_df.copy()
    data_standardized[:] = scaler.fit_transform(data_standardized)
    U, S, Vt = svd(data_standardized, full_matrices=False)
    V = Vt.T

    # plot best reconstruction pca vs svd (PC=16)
    df_cov = np.cov(data_standardized.T)
    eigenValues, eigenVectors = get_sorted_eigen(df_cov)

    best_recon_pca_std = compute_recon(city_data_std, 16, eigenVectors)
    best_recon_pca = scaler.inverse_transform(best_recon_pca_std).flatten()
    best_recon_svd_std = compute_recon(city_data_std, 16, V)
    best_recon_svd = scaler.inverse_transform(best_recon_svd_std).flatten()

    plt.figure(figsize=(20, 5))
    plt.plot(years, best_recon_pca, label="PCA Reconstruction", linestyle="--")
    plt.plot(years, best_recon_svd, label="SVD Reconstruction", linestyle=":")
    #plt.plot(years, city_data.values, label="Original Data")
    plt.title(f"Best Reconstruction PCA vs SVD for {city_name}")
    plt.xlabel("Year Month")
    plt.ylabel("Average Electricity Sales")
    plt.legend()
    plt.xticks(years[::12], rotation=45)
    plt.show()

    # plot residual errors for best reconstruction for pca vs svd
    re_pca = city_data.values - best_recon_pca
    re_svd = city_data.values - best_recon_svd

    plt.figure(figsize=(20, 5))
    plt.plot(years, re_pca, label="PCA Residual", linestyle="--")
    plt.plot(years, re_svd, label="SVD Residual", linestyle=":")
    plt.title(f"Residual Errors for Best Reconstruction for PCA vs SVD for {city_name}")
    plt.xlabel("Year Month")
    plt.ylabel("Residual Error")
    plt.legend()
    plt.xticks(years[::12], rotation=45)
    plt.show()

    # plot RMSE of reconstruction for pca vs svd
    rmse_pca = []
    rmse_svd = []
    for i in range(1, 11):
        recon_pca_std = compute_recon(city_data_std, i, eigenVectors)
        recon_pca = scaler.inverse_transform(recon_pca_std).flatten()
        rmse_pca.append(np.sqrt(mean_squared_error(city_data.values, recon_pca)))

        recon_svd_std = compute_recon(city_data_std, i, V)
        recon_svd = scaler.inverse_transform(recon_svd_std).flatten()
        rmse_svd.append(np.sqrt(mean_squared_error(city_data.values, recon_svd)))

    plt.figure(figsize=(10, 5))
    plt.plot(range(1, 11), rmse_pca, label="PCA RMSE", linestyle="--")
    plt.plot(range(1, 11), rmse_svd, label="SVD RMSE", linestyle=":")
    plt.title(f"RMSE of Reconstruction for PCA vs SVD for {city_name}")
    plt.xlabel("Number of Principal Components")
    plt.ylabel("RMSE")
    plt.legend()
    plt.xticks(range(1, 11))
    plt.show()

# plot comparison figures
for city in cities:
    plot_pca_svd_comparison(city_data, city)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

ReferencesΒΆ

Understanding PCA and SVD:

  1. https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8

  2. https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca

  3. https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

  4. https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/

PCA:

  1. Snippets from: https://plot.ly/ipython-notebooks/principal-component-analysis/

  2. https://www.value-at-risk.net/principal-component-analysis/

U.S. Electricity Prices Data:

  1. https://www.kaggle.com/datasets/alistairking/electricity-prices